feat: add embedded Process Supervisor for unified process lifecycle#1370
feat: add embedded Process Supervisor for unified process lifecycle#1370thedotmack merged 10 commits intomainfrom
Conversation
…anagement Consolidates scattered process management (ProcessManager, GracefulShutdown, HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module. New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_* vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill on Windows), PID file liveness validation, and singleton Supervisor API. Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions) Fixes #1356 (zombie TCP socket after Windows reboot) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged processes on session end. SessionManager.deleteSession() now triggers reaping. Tightens orphan reaper interval from 60s to 30s. Fixes #1351 (MCP server processes leak on session end) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces socket-manager.ts for UDS-based worker communication, eliminating port 37777 collisions between concurrent sessions. Worker listens on ~/.claude-mem/sockets/worker.sock by default with TCP fallback. All hook handlers, MCP server, health checks, and admin commands updated to use socket-aware workerHttpRequest(). Backwards compatible — settings can force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp. Fixes #1346 (port 37777 collision across concurrent sessions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the fallback path where hook scripts started WorkerService in-process, making the worker a grandchild of Claude Code (killed by sandbox). Hooks now always delegate to ensureWorkerStarted() which spawns a fully detached daemon. Fixes #1249 (grandchild process killed by sandbox) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds 30-second periodic health sweep that prunes dead processes from the supervisor registry and cleans stale socket files. Adds /api/admin/doctor endpoint exposing supervisor state, process liveness, and environment health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
64 tests covering all supervisor modules: process registry (18 tests), env sanitizer (8), shutdown cascade (10), socket manager (15), health checker (5), and supervisor API (6). Includes persistence, isolation, edge cases, and cross-module integration scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The socket-manager introduced UDS as default transport, but this broke the HTTP server's TCP accessibility (viewer UI, curl, external monitoring). Since there's only ever one worker process handling all sessions, the port collision rationale for UDS doesn't apply. Reverts to TCP-only, removing ~900 lines of unnecessary complexity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review: feat: add embedded Process Supervisor for unified process lifecycleOverall this is a solid, well-scoped PR. The supervisor architecture is clean, the test suite is comprehensive, and the env-sanitizer work is excellent. A few issues worth addressing before merge. Critical1. Race condition in signal handlers ( The shutdown handler is // Both SIGTERM and SIGINT can hit here before stopPromise is assigned
if (this.stopPromise) {
logger.warn('SYSTEM', `Received ${signal} but shutdown already in progress`);
return;
}Fix: Use a synchronous boolean flag set before the first if (this.shutdownInitiated) return;
this.shutdownInitiated = true;
this.stopPromise = this.shutdownHandler(signal);
await this.stopPromise;2. 50 iterations × 100ms = 5 seconds of polling per reap. There's also a redundant variable: // Line ~3551: "remaining" is immediately computed from same data as "survivors"
const remaining = survivors.filter(r => isPidAlive(r.pid));Consider replacing the poll loop with a brief High3. Duplicate env variable lists ( The env prefix/exact-match lists are duplicated in at least two places. Export them as constants from // env-sanitizer.ts
export const ENV_PREFIXES = ['CLAUDECODE_', 'CLAUDE_CODE_'];
export const ENV_EXACT_MATCHES = ['CLAUDECODE', 'CLAUDE_CODE_SESSION', ...];4. const zombiePidFiles = processes.some(p => p.status === 'dead'); // loses which PIDsThis makes the endpoint much less useful for debugging. Return the actual dead PID list: const deadProcesses = processes.filter(p => p.status === 'dead').map(p => p.pid);5. Duplicate SIGTERM → SIGKILL cascade logic
6. Reap failure logged at DEBUG, not WARN In } catch (error) {
logger.debug('SESSION', 'Supervisor reapSession failed (non-blocking)', ...);
}A process reaping failure during session cleanup is operationally significant. This should be at Medium7. const sessionIdNum = typeof sessionId === 'number' ? sessionId : Number(sessionId) || undefined;
8. const chromaProcess = (this.transport as unknown as { _process?: ChildProcess })._process;This is fragile against MCP SDK updates. The existing 9. Hard-coded SIGTERM/SIGKILL timeouts 5s and 1s are reasonable defaults, but they can't be tuned without a code change. Consider pulling them from settings or at least defining them as named constants at the top of the file. TestsThe 44-test suite is well-structured and the
Minor:
|
| Severity | Count | Must-fix before merge |
|---|---|---|
| Critical | 2 | Signal handler race, reapSession polling |
| High | 4 | Dup env lists, doctor endpoint info loss, dup cascade logic, log level |
| Medium | 3 | Session ID coercion, type cast, timeouts |
| Test gaps | 2 | Live-process reap test, signal handler test |
The architecture itself is sound — session-scoped reaping, persistent registry, env sanitization, and the /doctor endpoint are all good additions. The critical signal handler race and the observable issues with the doctor endpoint are the most important to resolve.
🤖 Generated with Claude Code
Remove unused `acceptingSpawns` field from Supervisor class (written but never read — assertCanSpawn uses stopPromise instead) and unused `buildWorkerUrl` import from context handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR Review: feat: add embedded Process Supervisor for unified process lifecycleThe direction is correct — centralizing process lifecycle in BugsBUG 1 — BUG 2 — this.stopPromise = runShutdownCascade(...).finally(() => {
this.started = false;
this.stopPromise = null; // ← re-entry is now possible after resolve
});After BUG 3 — Silent failure when BUG 4 — Windows SIGTERM does not kill subprocess trees BUG 5 — Map mutation during iteration in for (const [id, info] of this.entries) {
if (isPidAlive(info.pid)) continue;
this.entries.delete(id); // mutation during iterationThe ES spec guarantees safety here, but it's a footgun for future readers. Collect IDs first, then delete in a second pass. Race ConditionsRACE 1 — RACE 2 — PerformancePERF 1 — Synchronous SecuritySEC 1 — SEC 2 — Error HandlingERR 1 — ERR 2 — Deprecated ERR 3 — let timeoutId: ReturnType<typeof setTimeout>;
// ...
clearTimeout(timeoutId!); // non-null assertionThe Promise executor runs synchronously so this is correct today. Prefer
|
| Gap | Impact |
|---|---|
validateWorkerPidFile 'stale' and 'invalid' branches never hit |
Medium |
reapSession never tested with live processes |
Medium |
Supervisor.start() / stop() not tested in isolation (only integration) |
Medium |
Windows taskkill / tree-kill paths untested |
Low |
pruneDeadEntries double-kill scenario (SIGKILL survivor) |
Low |
Summary
The core architecture is sound and the signal handling + shutdown cascade is well-structured. The most actionable items are:
- Fix the
/api/admin/doctorenv-check list duplication (SEC 1) - Fix Windows SIGTERM not killing subprocess trees (BUG 4)
- Fix silent MCP/Chroma PID registration failures (BUG 3)
- Add
pruneDeadEntries()call beforereapSessionto avoid 5s waits on already-dead processes (RACE 2) - Consider async/debounced
persist()writes (PERF 1)
🤖 Generated with Claude Code
Code Review: feat: add embedded Process Supervisor for unified process lifecycleOverall this is a well-structured PR. The separation of concerns across supervisor modules is clean, the graceful-degradation philosophy in hooks is correct, and the test suite for BugsMedium: PID 0 sends signals to the entire process group — it should never be a valid managed process. Returning Medium: logger.info('SYSTEM', `Reaped ${sessionRecords.length} process(es)...`, { reaped: sessionRecords.length });This logs (and returns) the count of intended kills, not confirmed terminations. If SIGKILL-resistant processes survive, monitoring and callers get a misleading count. Medium: export interface ShutdownCascadeOptions {
dataDir?: string; // accepted everywhere, never usedEither wire it to the PID file path default or remove it before it confuses future contributors. Low: this.stopPromise = runShutdownCascade({...}).finally(() => {
this.started = false;
this.stopPromise = null; // ← nulled before outer await sees it
});
await this.stopPromise;A second concurrent Low: All hook requests default to the 3-second health-check timeout // worker-utils.ts
const timeoutMs = options.timeoutMs ?? HEALTH_CHECK_TIMEOUT_MS; // 3000ms
Code Quality
const zombiePidFiles = processes.some(p => p.status === 'dead'); // checks registry, not PID fileThis checks for dead entries in the process registry, not for a stale Env-check logic in // Server.ts lines 310-314 — verbatim copy of sanitizer logic
const envPrefixes = ['CLAUDECODE_', 'CLAUDE_CODE_'];
const exactMatches = [...];If the sanitizer's list is updated, the doctor endpoint will silently report false "clean" state. Extract the key list to a shared constant.
SIGHUP daemon detection reads raw if (process.argv.includes('--daemon')) { // fragileIf the flag name changes, SIGHUP suppression silently breaks. Consider reading this from the settings/config object. Test Coverage Gaps
it('returns "missing" when PID file does not exist', () => {
const status = validateWorkerPidFile({ logAlive: false });
expect(['missing', 'alive', 'stale', 'invalid']).toContain(status);
});This accepts all four possible return values — it will pass regardless of what the function does and catches no regressions. The test should set up a temp directory with no PID file and assert specifically
The tests check No test for SIGKILL escalation path in The "terminates children in reverse spawn order" test is good, but there's no test simulating a process that survives SIGTERM and requires the SIGKILL escalation. Worth adding given the two-phase logic. Minor
Positive Notes
🤖 Generated with Claude Code |
Review Fixes AppliedAddressed the following issues from the three code reviews: Must-Fix (7)
Should-Fix (3)
Test Results
🤖 Generated with Claude Code |
…tignore, harden supervisor - Downgrade request/response HTTP logging from info to debug to reduce noise - Remove unused getWorkerPort imports, use buildWorkerUrl helper - Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts - Fix isPidAlive(0) returning true (should be false) - Add shutdownInitiated flag to prevent signal handler race condition - Make validateWorkerPidFile testable with pidFilePath option - Remove unused dataDir from ShutdownCascadeOptions - Upgrade reapSession log from debug to warn - Rename zombiePidFiles to deadProcessPids (returns actual PIDs) - Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns - Fix tests to use temp directories instead of relying on real PID file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review: PR #1370 — feat: add embedded Process SupervisorOverall this is a well-structured PR. The supervisor layer is cleanly scoped, the migration from in-process fallback to Bugs[Medium] Double-unregister of In [Low] Dead import in import { ensureWorkerRunning, getWorkerPort, workerHttpRequest } from '../../shared/worker-utils.js';
[Low] async start(): Promise<void> {
if (this.started) return;
this.registry.initialize();
const pidStatus = validateWorkerPidFile({ logAlive: false });
if (pidStatus === 'alive') {
throw new Error('Worker already running'); // started remains false, healthChecker never called
}
this.started = true;
startHealthChecker();
}If Design Fragility[Low] Private const mcpProcess = (transport as unknown as { _process?: ChildProcess })._process;This accesses a private implementation detail of Minor Issues
Cosmetic: indentation in await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath: path.join(tempDir, 'worker.pid') // extra indent
});Cosmetic: Test Coverage NotesCoverage is solid overall. A few gaps worth noting:
Performance Notes
Summary
🤖 Generated with Claude Code |
Summary
src/supervisor/) with signal handling, process registry, shutdown cascade, and health checkingworkerHttpRequest()instead of in-process fallback (node→bun grandchild process gets SIGKILL'd in Claude Code sandbox #1249)/api/admin/doctordiagnostic endpoint for inspecting supervisor stateNote: Unix domain socket transport was added and then reverted in the final commit — TCP on port 37777 remains the only transport since there's only ever one worker process.
Test plan
bun test tests/supervisor/— 44/44 passbun test tests/infrastructure/ tests/hooks/ tests/utils/claude-md-utils.test.ts— 200/200 passcurl http://localhost:37777/api/health— 200 OK after build-and-sync/api/admin/doctorreturns supervisor diagnostics🤖 Generated with Claude Code